A statistical information extraction system for Turkish

نویسندگان

  • Gökhan Tür
  • Dilek Z. Hakkani-Tür
  • Kemal Oflazer
چکیده

This paper presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. In languages like English, there is a very small number of possible word forms with a given root word. However, languages like Turkish have very productive agglutinative morphology. Thus, it is an issue to build statistical models for specific tasks using the surface forms of the words, mainly because of the data sparseness problem. In order to alleviate this problem, we used additional syntactic information, i.e. the morphological structure of the words. We have successfully applied statistical methods using both the lexical and morphological information to sentence segmentation, topic segmentation, and name tagging tasks. For sentence segmentation, we have modeled the final inflectional groups of the words and combined it with the lexical model, and decreased the error rate to 4.34%, which is 21% better than the result obtained using only the surface forms of the words. For topic segmentation, stems of the words (especially nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set according to the weighted TDT-2 segmentation cost metric. This is 32% better than the word-based baseline model. For name tagging, we used four different information sources to model names. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained some improvement. After this, we modeled the morphological analyses of the words, and finally we modeled the tag sequence, and reached an F-Measure of 91.56%, according to the MUC evaluation criteria. Our results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Developing a Concept Extraction System for Turkish

In recent years, due to the vast amount of available electronic media and data, the necessity of analyzing electronic documents automatically was increased. In order to assess if a document contains valuable information or not, concepts, key phrases or main idea of the document have to be known. There are some studies on extracting key phrases or main ideas of documents for Turkish. However, to...

متن کامل

Name Tagging Using Lexical, Contextual, and Morphological Information

Abstract This paper presents a probabilistic model for automatically tagging names in a Turkish text. We used four different information sources to model names, and successfully combined them. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained a significant improvement. After this, we modeled the mor...

متن کامل

Design and evaluation of an ontology based information extraction system for radiological reports

This paper describes an information extraction system that extracts and converts the available information in free text Turkish radiology reports into a structured information model using manually created extraction rules and domain ontology. The ontology provides flexibility in the design of extraction rules, and determines the information model for the extracted semantic information. Although...

متن کامل

The Use of Remote Sensing and GIS Technologies for Comprehensive Wastewater Management

In the present study, it was aimed at combining remote sensing technology, geographic information system and statistical data in order to provide a rather rapid, sensitive and comprehensive overview of current situation of Turkish river basins in terms of existing spatial data. For the aim of the study, all statistical information gathered from the national authorities on regional basis was ove...

متن کامل

A Turkish Handprint Character Recognition System

This paper presents a study for recognizing isolated Turkish handwritten uppercase letters. In the study, first of all, a Turkish Handprint Character Database has been created from the students in Istanbul Technical University (ITU). There are about 20000 uppercase and 7000 digit samples in this database. Several feature extraction and classification techniques are realized and combined to find...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Natural Language Engineering

دوره 9  شماره 

صفحات  -

تاریخ انتشار 2003